Few Shot Instance Segmentation (FSIS) requires models to detect and segment novel classes with limited several support examples. In this work, we explore a simple yet unified solution for FSIS as well as its incremental variants, and introduce a new framework named Reference Twice (RefT) to fully explore the relationship between support/query features based on a Transformer-like framework. Our key insights are two folds: Firstly, with the aid of support masks, we can generate dynamic class centers more appropriately to re-weight query features. Secondly, we find that support object queries have already encoded key factors after base training. In this way, the query features can be enhanced twice from two aspects, i.e., feature-level and instance-level. In particular, we firstly design a mask-based dynamic weighting module to enhance support features and then propose to link object queries for better calibration via cross-attention. After the above steps, the novel classes can be improved significantly over our strong baseline. Additionally, our new framework can be easily extended to incremental FSIS with minor modification. When benchmarking results on the COCO dataset for FSIS, gFSIS, and iFSIS settings, our method achieves a competitive performance compared to existing approaches across different shots, e.g., we boost nAP by noticeable +8.2/+9.4 over the current state-of-the-art FSIS method for 10/30-shot. We further demonstrate the superiority of our approach on Few Shot Object Detection. Code and model will be available.
translated by 谷歌翻译
Current domain adaptation methods for face anti-spoofing leverage labeled source domain data and unlabeled target domain data to obtain a promising generalizable decision boundary. However, it is usually difficult for these methods to achieve a perfect domain-invariant liveness feature disentanglement, which may degrade the final classification performance by domain differences in illumination, face category, spoof type, etc. In this work, we tackle cross-scenario face anti-spoofing by proposing a novel domain adaptation method called cyclically disentangled feature translation network (CDFTN). Specifically, CDFTN generates pseudo-labeled samples that possess: 1) source domain-invariant liveness features and 2) target domain-specific content features, which are disentangled through domain adversarial training. A robust classifier is trained based on the synthetic pseudo-labeled images under the supervision of source domain labels. We further extend CDFTN for multi-target domain adaptation by leveraging data from more unlabeled target domains. Extensive experiments on several public datasets demonstrate that our proposed approach significantly outperforms the state of the art.
translated by 谷歌翻译
The problem of covariate-shift generalization has attracted intensive research attention. Previous stable learning algorithms employ sample reweighting schemes to decorrelate the covariates when there is no explicit domain information about training data. However, with finite samples, it is difficult to achieve the desirable weights that ensure perfect independence to get rid of the unstable variables. Besides, decorrelating within stable variables may bring about high variance of learned models because of the over-reduced effective sample size. A tremendous sample size is required for these algorithms to work. In this paper, with theoretical justification, we propose SVI (Sparse Variable Independence) for the covariate-shift generalization problem. We introduce sparsity constraint to compensate for the imperfectness of sample reweighting under the finite-sample setting in previous methods. Furthermore, we organically combine independence-based sample reweighting and sparsity-based variable selection in an iterative way to avoid decorrelating within stable variables, increasing the effective sample size to alleviate variance inflation. Experiments on both synthetic and real-world datasets demonstrate the improvement of covariate-shift generalization performance brought by SVI.
translated by 谷歌翻译
This work is concerned with solving neural network-based feedback controllers efficiently for optimal control problems. We first conduct a comparative study of two mainstream approaches: offline supervised learning and online direct policy optimization. Albeit the training part of the supervised learning approach is relatively easy, the success of the method heavily depends on the optimal control dataset generated by open-loop optimal control solvers. In contrast, direct optimization turns the optimal control problem into an optimization problem directly without any requirement of pre-computing, but the dynamics-related objective can be hard to optimize when the problem is complicated. Our results highlight the priority of offline supervised learning in terms of both optimality and training time. To overcome the main challenges, dataset, and optimization, in the two approaches respectively, we complement them and propose the Pre-train and Fine-tune strategy as a unified training paradigm for optimal feedback control, which further improves the performance and robustness significantly. Our code is available at https://github.com/yzhao98/DeepOptimalControl.
translated by 谷歌翻译
When reading a story, humans can rapidly understand new fictional characters with a few observations, mainly by drawing analogy to fictional and real people they met before in their lives. This reflects the few-shot and meta-learning essence of humans' inference of characters' mental states, i.e., humans' theory-of-mind (ToM), which is largely ignored in existing research. We fill this gap with a novel NLP benchmark, TOM-IN-AMC, the first assessment of models' ability of meta-learning of ToM in a realistic narrative understanding scenario. Our benchmark consists of $\sim$1,000 parsed movie scripts for this purpose, each corresponding to a few-shot character understanding task; and requires models to mimic humans' ability of fast digesting characters with a few starting scenes in a new movie. Our human study verified that humans can solve our problem by inferring characters' mental states based on their previously seen movies; while the state-of-the-art metric-learning and meta-learning approaches adapted to our task lags 30% behind.
translated by 谷歌翻译
Frozen pretrained models have become a viable alternative to the pretraining-then-finetuning paradigm for transfer learning. However, with frozen models there are relatively few parameters available for adapting to downstream tasks, which is problematic in computer vision where tasks vary significantly in input/output format and the type of information that is of value. In this paper, we present a study of frozen pretrained models when applied to diverse and representative computer vision tasks, including object detection, semantic segmentation and video action recognition. From this empirical analysis, our work answers the questions of what pretraining task fits best with this frozen setting, how to make the frozen setting more flexible to various downstream tasks, and the effect of larger model sizes. We additionally examine the upper bound of performance using a giant frozen pretrained model with 3 billion parameters (SwinV2-G) and find that it reaches competitive performance on a varied set of major benchmarks with only one shared frozen base network: 60.0 box mAP and 52.2 mask mAP on COCO object detection test-dev, 57.6 val mIoU on ADE20K semantic segmentation, and 81.7 top-1 accuracy on Kinetics-400 action recognition. With this work, we hope to bring greater attention to this promising path of freezing pretrained image models.
translated by 谷歌翻译
尖峰神经网络(SNN)是一种具有生物学知识的模型,具有高计算能力和低功耗的优势。虽然对深SNN的培训仍然是一个空旷的问题,但它限制了深SNN的现实应用。在这里,我们提出了一个名为Spiking SiamFC ++的深SNN架构,用于对象跟踪,并通过端到端直接培训。具体而言,Alexnet网络在时间域中扩展以提取该功能,并采用替代梯度功能来实现对深SNN的直接监督培训。为了检查尖峰SiAMFC ++的性能,考虑了几种跟踪基准测试,包括OTB2013,OTB2015,Dot2015,Dot2016和UAV123。发现与原始的siAMFC ++相比,精度损失很小。与现有的基于SNN的目标跟踪器相比,例如暹罗(Siamsnn),提议的Spiking SiamFC ++的精度(连续)达到了85.24%(64.37%),远高于52.78%(44.32%)的精度(64.37%)。 。据我们所知,Spiking SiamFC ++的性能优于基于SNN的对象跟踪中现有的最新方法,该方法为目标跟踪领域中的SNN应用提供了新的路径。这项工作可能会进一步促进SNN算法和神经形态芯片的发展。
translated by 谷歌翻译
电动汽车(EV)在自动启动的按需(AMOD)系统中起关键作用,但是它们的独特充电模式增加了AMOD系统中的模型不确定性(例如,状态过渡概率)。由于通常存在训练和测试(真)环境之间的不匹配,因此将模型不确定性纳入系统设计至关重要。但是,在现有文献重新平衡的EV AMOD系统中,尚未明确考虑模型不确定性,并且仍然是一项紧急和挑战的任务。在这项工作中,我们为EV重新平衡和充电问题设计了一个强大而有限的多机构增强学习(MARL)框架。然后,我们提出了一种强大且受限的MARL算法(Rocoma),该算法训练了强大的EV重新平衡政策,以平衡供需比率和整个城市的充电利用率在国家过渡不确定性下。实验表明,Rocoma可以学习有效且强大的重新平衡政策。当存在模型不确定性时,它的表现优于非稳定MAL方法。它使系统公平性增加了19.6%,并使重新平衡成本降低了75.8%。
translated by 谷歌翻译
基于文本的游戏(TBG)是复杂的环境,允许用户或计算机代理进行文本交互并实现游戏目标。为基于文本的游戏构建面向目标的计算机代理是一项挑战,尤其是当我们使用逐步反馈作为模型的唯一文本输入时。此外,代理商很难通过从更大的文本输入空间中评估灵活的长度和形式。在本文中,我们对应用于基于文本的游戏字段的深度学习方法进行了广泛的分析。
translated by 谷歌翻译
视觉和听觉信息对于确定视频中的显着区域都是有价值的。深度卷积神经网络(CNN)展示了应对视听显着性预测任务的强大能力。由于各种因素,例如拍摄场景和天气,源训练数据和目标测试数据之间通常存在适度的分布差异。域差异导致CNN模型目标测试数据的性能降解。本文提前尝试解决视听显着性预测的无监督域适应问题。我们提出了一种双重域交流学习算法,以减轻源数据和目标数据之间的域差异。首先,建立了一个特定的域歧视分支,以对齐听觉功能分布。然后,这些听觉功能通过跨模式自我发项模块融合到视觉特征中。设计了其他域歧视分支,以减少视觉特征的域差异和融合视听特征所隐含的视听相关性的差异。公共基准测试的实验表明,我们的方法可以减轻域差异引起的性能降解。
translated by 谷歌翻译